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Abstract. In probabilistic grammatical inference, a usual goal is to infer a good 
approximation of an unknown distribution P called a stochastic language. The 
estimate of P stands in some class of probabilistic models such as probabilistic 
automata (PA). In this paper, we focus on probabilistic models based on mul- 
tiplicity automata (MA). The stochastic languages generated by MA are called 
rational stochastic languages; they strictly include stochastic languages gener- 
ated by PA; they also admit a very concise canonical representation. Despite the 
fact that this class is not recursively enumerable, it is efficiently identifiable in 
the limit by using the algorithm DEES, introduced by the authors in a previous 
paper. However, the identification is not proper and before the convergence of 
the algorithm, DEES can produce MA that do not define stochastic languages. 
Nevertheless, it is possible to use these MA to define stochastic languages. We 
show that they belong to a broader class of rational series, that we call pseudo- 
stochastic rational languages. The aim of this paper is twofold. First we provide 
a theoretical study of pseudo-stochastic rational languages, the languages output 
by DEES, showing for example that this class is decidable within polynomial 
time. Second, we have carried out a lot of experiments in order to compare DEES 
to classical inference algorithms such as ALERGIA and MDI. They show that 
DEES outperforms them in most cases. 

Keywords, pseudo-stochastic rational languages, multiplicity automata, proba- 
bilistic grammatical inference. 

1 Introduction 

In probabilistic grammatical inference, we often consider stochastic languages which 
define distributions over £*, the set of all the possible words over an alphabet S. In 
general, we consider an unknown distribution P and the goal is to find a good approxi- 
mation given a finite sample of words independently drawn from P. 

The class of probabilistic automata (PA) is often used for modeling such distribu- 
tions. This class has the same expressiveness as Hidden Markov Models and is identi- 
fiable in the limit [4]. However, there exists no efficient algorithm for identifying PA. 
This can be explained by the fact that there exists no canonical representation of these 
automata which makes it difficult to correctly identify the structure of the target. One so- 
lution is to focus on subclasses of PA such as probabilistic deterministic automata [3,9] 
but with an important lack of expressiveness. Another solution consists in considering 
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the class of multiplicity automata (MA). These models admit a canonical representa- 
tion which offers good opportunities from a machine learning point of view. MA define 
functions that compute rational series with values in R [5], MA are a strict generaliza- 
tion of PA and the stochastic languages generated by PA are special cases of rational 
stochastic languages. Let us denote by S'^ t (S) the class of rational stochastic lan- 
guages computed by MA with parameters in K where K G {Q,Q + ,R, M + }. With 
K = Q + or K = R + , S r ^ lt {S) is exactly the class of stochastic languages generated 
by PA with parameters in K. But, when K = Q or K = K, we obtain strictly greater 
classes. This provides several advantages: Elements of S^iS) have a minimal normal 
representation, thus elements of 5™* (£) may have significantly smaller representation 
in <SJf parameters of these minimal representations are directly related to prob- 
abilities of some natural events of the form uS*, which can be efficiently estimated 
from stochastic samples; lastly when K is a field, rational series over K form a vector 
space and efficient linear algebra techniques can be used to deal with rational stochastic 
languages. 

However, the class Sq Lt (£) presents a serious drawback: There exists no recur- 
sively enumerable subset class of MA which exactly generates it [4]. As a conse- 
quence, no proper identification algorithm can exist: indeed, applying a proper iden- 
tification algorithm to an enumeration of samples of S* would provide an enumera- 
tion of the class of rational stochastic languages over Q. In spite of this result, there 
exists an efficient algorithm, DEES, which is able to identify S r I ^ t {E) in the limit. 
But before reaching the target, DEES can produce MA that do not define stochastic 
languages. However, it has been shown in [6] that with probability one, for any ra- 
tional stochastic language p, if DEES is given as input a sufficiently large sample S 
drawn according to p, DEES outputs a rational series such that Y^ues* r ( u ) conver g es 
absolutely to 1. Moreover, J2 u eS' \p( u ) ~ r ( u )\ converges to as the size of S in- 
creases. We show that these MA belong to a broader class of rational series, that we 
call pseudo-stochastic rational languages. A pseudo-stochastic rational language r has 
the property that r(uS*) = lim n ^ r00 r(uE- n ) is defined for any word u and that 
r{E* ) = 1. A stochastic language p r can be associated with r in such a way that 

J2ueE* \Pr( u ) ~ r ( u )\ = 2Er(u)<o \ r ( u )\ when the sum Y,ueE* r ( u ) is absolutely 
convergent. As a first consequence, p r = r when r is a stochastic language. As a second 
consequence, for any rational stochastic language p, if DEES is given as input increas- 
ing samples drawn according to p, DEES outputs pseudo-stochastic rational languages 
r such that YlueS* \p( u ) — Pr{ u )\ converges to as the size of S increases. 

The aim of this paper is twofold: To provide a theoretical study of the class of 
pseudo-stochastic rational languages and a series of experiments in order to compare 
the performance of DEES to two classical inference algorithms: ALERGIA [3] and 
MDI [9]. We show that the class of pseudo-stochastic rational languages is decidable 
within polynomial time. We provide an algorithm that can be used to compute p r (u) 
from any MA that computes r. We also show how it is possible to simulate p r using such 
an automaton. We show that there exist pseudo-stochastic rational languages r such that 
p r is not rational. Finally, we show that it is undecidable whether two pseudo-stochastic 
rational languages define the same stochastic language. We have carried out a lot of 
experiments which show that DEES outperforms ALERGIA and MDI in most cases. 
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These results were expected since ALERGIA and MDI have not the same theoretical 
expressiveness and since DEES aims at producing a minimal representation of the target 
in the set of MA, which can be significantly smaller than the smaller equivalent PDA 
(if it exists). 

The paper is organized as follows. In section 2, we introduce some background 
about multiplicity automata, rational series and stochastic languages and present the al- 
gorithm DEES. Section 3 deals with our study of pseudo-rational stochastic languages. 
Our experiments are detailed in Section 4. 

2 Definitions and notations 

2.1 Rational series, multiplicity automata and stochastic languages 

Let S* be the set of words on the finite alphabet S. A language is a subset of £*. The 
empty word is denoted by e and the length of a word u is denoted by \u\. For any integer 
k, let £ k = {u£ £* : \u\ = k} and = {u G S* : \u\ < k}. We denote by < the 
length-lexicographic order on S* and by MinL the minimal element of a non empty 
language L according to this order. A subset S of E* is prefix-closed if for any u,v £ 
S*,uveS^ueS. For any S C £*, Mpref{S) = {u G S* : 3v G E*,uv G S} 
and fact(S) = {iieX*: 3u, w G £*, uvw G S}. 

A formal power series is a mapping r of E* into K. The set of all formal power 
series is denoted by K((I7)). It is a vector space. For any series r and any word u, let us 
denote by ur the series defined by iir(w) = r(uw) for every word w. Let us denote by 
supp(r) the support of r, i.e. the set {w G S* : r(w) 7^ 0}. A stochastic language is a 
formal series p which takes its values in R + and such that Y^wes* p( w ) = 1- The set 
of all stochastic languages over E is denoted by S{£). For any language L C S* and 
any p G S(U), let us denote YlweLP( w ) by p{L)- For any p G S(S) and u G £ such 
that p(uS*) ^ 0, the residual language of p wrt u is the stochastic language defined 
by u _1 pby u~ 1 p(w) = ^^ry- We denote by res(p) the set {u <E £* : p(uS*) =/= 0} 
and by Res(p) the set {u~ 1 p : u G res(p)}. 

Let S 1 be a sample over Z 1 *, i.e. a multiset composed of words over 2J*. We denote 
by ps the empirical distribution over S* associated with S. Let S be an infinite sam- 
ple composed of words independently drawn according to a stochastic language p. We 
denote by S n the sequence composed of the n first words of S. 

We introduce now the notion of multiplicity automata (MA). Let K G {R, Q, K + , Q + }. 
A K -multiplicity automaton (MA) is a 5-tuple (S ', Q, ip, t, r) where Q is a finite set of 
states, ip: QxSxQ^Kis the transition function, « , : Q — > A" is the initialization 
function, r : Q — > A' is the termination function. We extend the transition function 
<p to Q x Z 1 * x Q by ^(q, to, r) = J2seQ w > s ) ^( s > ^ r ) and ^(^^ £ j r ) = 1 if 
q = r and otherwise, for any q,r £ Q, x e £ and to G Z*. For any finite subset 
L C S* and any i? C Q, define <p(q,L,R) = J2 w eL reR fi^ w > r )- ^ e denote by 
Qi = {q G 7^ 0} the set of initial states and by Qt = {q G Q|r(g l ) 7^ 0} the set 

of terminal states. A state q G Q is accessible (resp. co-accessible) if there exists go €E 
Q/ (resp. (ft G Qt) and u G S* such that ^(go, u, q) 7^ (resp. yj(g, u, qt) 7^ 0). An 



4 



Amaury Habrard, Francois Denis, and Yann Esposito 



MA is trimmed if all its states are accessible and co-accessible. From now, we only con- 
sider trimmed MA. The support of an MA A = (£, Q, ip, i, t) is the Non-deterministic 
Finite Automaton (NFA) (S,Q,Qi,Qt,S) where 6(q, x) = {q' G Q\ip(q,x,q') ^ 0}. 

The spectral radius of a square matrix M if the maximum magnitude of its eigen- 
values. Let A = (£, Q = {qi, . . . , q n }, i, tp, t) be an MA. Let us denote by p{A) be 
the spectral radius of the square matrix [<p(qi, S, Qj)]i<i,j<n (p(A) does not depends 
on the order of the states). If p(A) < 1 then each sequence rA,q{£- n ) converges to a 
number s q and hence, r(£- n ) converges too [6]. Let us denote by r(S*) the limit of 
r(E-) when it exists. The numbers s q are the unique solutions of the following linear 
system of equations (and therefore are computable within polynomial time): 

s q = r A , q + E 9 ' S q <p{q, V for Q e Q. 
It is decidable within polynomial time whether p(A) < 1 [2,7]. 

A Probabilistic Automaton (PA) is a trimmed MA (E, Q, ip, i, t) s.t. l, ip and r take 
their values in [0, 1], s.t. J2 q eQ t ( ( ?) = 1 an d f° r an y state <?' T (l) + f(li ^> Q) = 1- 
A Probabilistic Deterministic Automaton (PDA) is a PA whose support is deterministic. 
It can be shown that Probabilistic Automata generate stochastic languages. Let us de- 
note by Sk A (U) (resp. S^ DA (E)) the class of all stochastic languages which can be 
computed by a PA (resp. a PDA). 

For any MA A, let ta be the series defined by ta(w) = J2n,r£Q fi^i w,r)r(r). 
For any q 6 Q, we also define the series by rA. q {w) = 2reQ v(3> w > r ) T ( r )- An 
MA A is reduced if the set {rA, q \q G Q} is linearly independent in the vector space 
M.((S)). An MA A is prefix-closed if (i) its set of states Q is a prefix-closed subset of 
£*, (ii) Qi = {e} and (iii) Vu G Q, S(e, u) = {u} where 6 is the transition function in 
the support of A. 

Rational series have several characterization ([1,8]). Here, we shall say that a for- 
mal power series over 2J is K -rational iff there exists a /^-multiplicity automaton A 
such that r = r A , where K G {R, K+, Q, Let us denote by K rat ((E}) the 
set of ^-rational series over S and by = K rat ((S)) n S(U), the set of 

rational stochastic languages over K. It can be shown that a series r is R-rational 
iff the set {ur|it G S*} spans a finite dimensional vector subspace of R((I7)). As 
a corollary, a stochastic language p is R-rational iff the set Res(p) spans a finite di- 
mensional vector subspace [Res(pj\ of R((27)). Rational stochastic languages have 
been studied in [5] from a language theoretical point of view. It is worth noting that 
Sg DA (£) C S£ A (£) = Sg?(E) C S^, at (£). From now on, a rational stochastic 
language will always denote an R-rational stochastic language. 

Rational stochastic languages have a serious drawback. There exists no recursively 
enumerable subset of multiplicity automata capable to generate them [4,5]. As a conse- 
quence, it is undecidable whether a given MA computes a stochastic language. 

Every rational language is the support of a rational series but the converse is false: 
there exists rational series whose supports are not rational. For example, it can be shown 
that the complementary set of {a n b n \n G N} in {a, b}* is the support of a rational 
series. However, a variant of Pumping Lemma holds for languages which are support 
of rational series. Let L be such a language. There exists an integer N such that for any 
word w = uv G L satisfying \v\ > N, there exists v±, V2, «3 such that v = V1V2V3 and 
L fl UV1V2V3 is infinite [1]. 
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Rational stochastic languages admit a canonical representation by reduced prefix- 
closed MA. Let p be a rational stochastic language and let Q p be the smallest ba- 
sis of [Res(p)] (for the order induced by < on the finite subsets of £*). Let A = 
(E,Q p ,cp, l, t) be the MA defined by: (i) t(e) = 1, t(u) = otherwise; t(u) = 
u~ 1 p(e), (ii) ip(u,x,ux) = u~ 1 p(xE*) if u,ux € Q p and x 6 S, (Hi) (p(u,x,v) = 
a v u~ 1 p(x£*) if x £ S, ux € {Q p S \ Q p ) n res(jp) and (ux) p = YlveQ a vV~ l P- 
It can be shown that A is a reduced prefix-closed MA which computes p and such that 
p(A) < 1. A is called the canonical representation of p. Note that the parameters of 
A correspond to natural components of the residual of p and can be estimated by using 
samples of p. 

2.2 Inference of rational stochastic languages 

The algorithm DEES [6] is able to identify rational stochastic languages: with prob- 
ability one, for every rational stochastic language p and every infinite sample S of p, 
there exists an integer N such that for every n > N, DEES(5„) outputs the canonical 
representation A of p. Before its presentation, we introduce informally the basic idea 
of the algorithm. First, the goal is to find the structure of the automaton, i.e. the set of 
states Q p smallest basis of [Res(P)]. The inference proceeds as follows: the algorithm 
begins by building a unique state which corresponds to the residual e~ 1 ps- Each state 
of the automaton corresponds to some residual u~ 1 p s where u is the prefix of some 
examples in S. After having built a state corresponding to u~ 1 p s , for any letter x, the 
algorithm studies the possibility of adding a new state corresponding to (ux)~ 1 p s or of 
creating transitions labeled by x that lead to the states already built in the automaton. 
A new state will be added to the automaton if the residual language corresponding to 
(ux)~ 1 p s cannot be approximated as a linear combination of the residual languages 
corresponding the states already built. 

The pseudo-code of the algorithm is presented in Algorithm 1 . In order to find a 
linear combination, DEES uses the following set of inequalities where S is a non empty 
finite sample of S*, Q a prefix-closed subset of pref(S), v € pref(S) \ Q, and e > 0: 
I(Q, v, S, e) = {[v^PsiwE*) - £ ueQ X^PsiwS*)] < e\w G fact(S)} U {£ ueQ X u = 1}. 

DEES runs in polynomial time in the size of S and identifies in the limit the structure 
of the canonical representation A of the target p. Once the correct structure of A is 
found, the algorithm computes estimates as of each parameter a of A such that \a — 
a s\ = Od^l -1 ' 3 ). The output automaton A computes a rational series ta such that 
J2wee* r A{ w ) converges absolutely to 1. Moreover, it can be shown that ta converges 
to the target p under the Dl distance (also called the LI norm), stronger than distance 
D2 or Daa - . ^2 w£S , \ta{w) — p{w)\ tends to when the size of S tends to 00. If 
the parameters of A are rational numbers, a variant of DEES can identify exactly the 
target [6]. 

We give now a simple example that illustrates DEES. Let us consider a sample 5* = 
{e, a, aa, aaa} such that |e| = 10, \a\ = \aa\ = 20, \aaa\ = 10. We have the following 
values for the empirical distribution: Ps(e) = Ps(aaa) = Ps{aaaS*) = ^, Ps{a) = 
P s (aa) = ±, P s (a£*) = f, P s (aaS*) = \ and P s (aaaaS*) = 0, e = -^j = 

0.255. With the sample S, DEES will infer a multiplicity automaton in three steps: 
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Input: a sample S 

Output: a prefix-closed reduced MA A — {E, Q, tp, i, r) 

Q <- {e}; i(e) <- 1 ; r(e) «- P s (e); 

pref(S) /*F is the frontier set*/; 
while F / do 

i' <— MinF s.t. t> = w.a; where u £ E* and a; G E; 
F <— F\{u}; 

if S, |5| -1 ^ 3 ) to no solution then 

Q^Qu{«}; t(«)<-0; r(w) <- P s («)/Ps(^*); 
y>(u,a;,u) <- P S (vE*)/P s (uE*); F <- F U {ra G res(P s )ja; G T}}; 



else 



let (a M )u>eQ be a solution of I(Q,v,S, \S\ 1/3 ); 
foreach w G Q do y>(«, x, w) «— a w Ps(vE*)/Ps(uE* 





Algorithm 1: Algorithm DEES. 



1 




(i 





(a) InitialisatioiLwith e„ . (b) CreatiptLof a new state. , . , ^^fc) Final 
iig. 1. illustration of the different steps of algorithm DEES. 



automaton. 



1. We begin by constructing a state for e (Figure 1(a)). 

2. We examine Ps(vS*) with v = ea to decide if we need to add a new state for 
the string a. We obtain the following system which has in fact no solution and we 
create a new state as shown in Figure 1(b). 



P s (vaE*) 



< b 



Ps (vaaaS* ) Pc; (aaa£* ) V /-i 



P s (uqq,S*) P s (aaE*) y \ ^ , 
| P s {vZ*) Ps(S') °> 

X e =1 \ 



3. We examine Pg(vE*) with v = aa to decide if we need to create a new state for 
the string aa. We obtain the system below. It is easy to see that this system admits 
at least one solution X £ = — | and X a = f . Then, we add two transitions to the 
automaton and we obtain the automaton of Figure 1(c) and the algorithm halts. 



P s (i>qg*) _ Ps(aS-) y _ P s (aaS') y < , F s (»a.r) 
P S (^^*) P S (^*) £ P S (.aE>) Aa \ ^ °'j Ps(«^*) 

Ptj (t?aaa.C* ) Ps(aaa£ ) j^- Ps(aaa£ ) 



P S (U») A ' 



Ps(»°g') 



P s (aS*) 



X £ + X a = 1 



X« < 6, 

} 



Since no recursively enumerable subset of MA is capable to generate the set of 
rational stochastic languages, no identification algorithm can be proper. This remark 
applies to DEES. There is no guarantee at any step that the automaton A output by 
DEES computes a stochastic language. However, the rational series r computed by 
the MA output by DEES can be used to compute a stochastic language p r that also 
converges to the target [6]. Moreover, they have several nice properties which make 
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them close to stochastic languages: We call them pseudo-stochastic rational languages 
and we study their properties in the next Section. 

3 Pseudo-stochastic rational languages 

The canonical representation A of a rational stochastic language satisfies p(A) < 1 and 
J2we£* r A{w) = 1. We use this characteristic to define the notion of pseudo-stochastic 
rational language. 

Definition 1. We say that a rational series r is a pseudo-stochastic language if there 
exists an MA A which computes r and such that p(A) < 1 and ifr(S*) = 1. 

Note that the condition p(A) < 1 implies that r(E*) is defined without ambiguity. A 
rational stochastic language is a pseudo-stochastic rational language but the converse is 
false. 

Example. Let A = (S,{qo},(f,t,T) defined by £ = {a, b}, i(qo) = r(q ) = 
1, tp(qo,a,q ) = 1 and ip(q ,b,q ) = — 1. We have ta{u) = (— l)' 11 ' 6 . Check that 
p{A) = and r A (uU*) = (-l)M* for every word u. Hence, ta is a pseudo stochastic 
language. 

As indicated in the previous section, any canonical representation A of a rational 
stochastic language satisfies p(A) < 1. In fact, the next Lemma shows that any reduced 
representation A of a pseudo-stochastic language satisfies p(A) < 1. 

Lemma 1. Let Abe a reduced representation of a pseudo-stochastic language. Then, 
p(A) < 1. 

Proof. The proof is detailed in Annex 6.1. 

Proposition 1. It is decidable within polynomial time whether a given MA computes a 
pseudo-stochastic language. 

Proof. Given an MA B, compute a reduced representation A of B, check whether 
p(A) < 1 and then, compute rvi (£"*)■ □ 

It has been shown in [6] that a stochastic language p r can be associated with a 
pseudo-stochastic rational language r: the idea is to prune in S* all subsets uS* such 
that r(uU*) < and to normalize in order to obtain a stochastic language. Let N be 
the smallest prefix-closed subset of S* satisfying 

e e N and \fu e N, x e S, ux 6 N iff r{uxS*) > 0. 

For every u G S* \N, define p r (u) = 0. For every u G TV, letA„ = Max(r(u), 0) + 
Szex; Max(r(uxE*), 0). Then, define p r {u) = Max(r(u), 0)/A M . It can be shown 
(see [6]) that r(u) < => p r {u) = and r(u) > r(u) > p r (u). 
The difference between r and p r is simple to express when the sum J2 u es* r ( u ) 
converges absolutely. Let TV r = J2r(u)<o \ r ( u )\- We have HweE* \ r ( u ) ~ Pr( u )\ = 
N r + Er(u)>o( r ( u ) ~Pr(u)) = 2N r ~+ EueZ*( r ( u ) ~ Pr(u)) = 2TV r . Note that 
when r is a stochastic language^^g^, r(u) converges absolutely and TV r = 0. As a 
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Input: MA A = {E, Q = {qi, . . . ,q n }, ip,t,r) s.t. p(A) < 1 and r A {E*) = 1 

a word u 
Output: p rA («) , pr A (uE*) 

for i — 1, . . . ,n /* this step is polynomial in n and is done once */ do 
|_ Si <— TA, qi {E*)\ &i <— 

w < — £\ X <— 1 /* A is equal to p rA (wE*)*/ ; 
repeat 

p <- EILi e i T (<n); S <~ {(w,Max(iJL,0))}; 
for x £ JC do 

_ M *- Z^j=i e i<P(<Hi x ><lj) s j> S <- SU {(wx,Max(p,0))}; 
a <— £\ , gg /i; S <— {(a-, p/a)\(x, p) £ S 1 } /*normalization*/ ; 
if w — u then p rA (u) <— Xp /*where (u, p) £ S and A = p rA (uE*)*/; 
else 

Let x £ E s.t. ira is a prefix of u and let p s.t. (ira, /i) G 5*; 

w ^ wx; X <— A/^; for i = 1, . . . , n do <— 2~^?=i e i t P(9i> x > 9*) > 

end 
until w = u; 

Algorithm 2: Algorithm computing p r . 





Fig. 2. An example of pseudo- 



consequence, in that case, p r = r. We give in Algorithm 2 an algorithm that computes 
p r {u) and p r (uE*) for any word u from any MA that computes r. This algorithm is 
linear in the length of the input. It can be slightly modified to generate a word drawn 
according to p r (see Annex 6.3). 

a,pa;b,p a, p ; b, p[3 The stochastic languages p r associated with pseudo- 
stochastic rational languages r can be not rational. 

Proposition 2. There exists pseudo-stochastic ra- 
tional languages r such that p r is not rational. 

Proof. Suppose that the parameters of the automaton 

A described on Figure 2 satisfy p(a + 1) + t\ =1 

WM C+ fWrrfa > p > 1. Then the series r m and r„, are rational 

which are oof rational. n ~ ,~ . , . . . 

stochastic languages and therefore, ta = 3r qi /2 — r q . 2 /2 is a rational series which 

satisfies E„ e s* \ r A(u)\ < 2 and J2 ueS , r ^( M ) = 1 - 

Let us show that p rA is not rational. For any u £ E*,ta{u) = £w-(3a' u \"Ti — /3' u ' b r 2 ). 
For any integer n, there exists an integer m n such that for any integer i, rA{a n b l ) > 
iff i < m n . Moreover, it is clear that m n tends to infinity with m. Suppose now thatp rA 
is rational and let L be its support. From the Pumping Lemma, there exists an integer 
such that for any word w = uv £ L satisfying \v\ > N, there exists vi, V2, f3 such that 
v = V1V2V3 and LDuviv^i^ is infinite. Let n be such that m n > N and let u = a" and 
v = 6'"' 1 . Since w = uv £ L, L n a n b* should be infinite, which is is false. Therefore, 
L is not the support of a rational language. □ 



Different rational series may yield the same pseudo-rational stochastic language. Is 
it decidable whether two pseudo-stochastic rational series define the same stochastic 
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Fig. 3. A a define stochastic language which can be represented by a PA with at least 2n states 
when a = — . With Ao = A2 = 1 and Ai = 0, the MA A^/q defines a stochastic language P 
whose prefixed reduced representation is the MA B (with approximate values on transitions). In 
fact, P can be computed by a PDA and the smallest PA computing it is C. 

language? Unfortunately, the answer is no. The proof relies on the following result: it is 
undecidable whether a multiplicity automaton A over £ satisfies ta{u) < for every 
u G S* [8]. It is easy to show that this result still holds for the set of MA A which 
satisfy Ir^u)] < A' w ', for any A > 0. 

Proposition 3. It is undecidable whether two rational series define the same stochastic 
language. 

Proof. The proof is detailed in Annex 6.2. 

4 Experiments 

In this section, we present a set of experiments allowing us to study the performance 
of the algorithm DEES for learning good stochastic language models. Hence, we will 
study the behavior of DEES with samples of distributions generated from PDA, PA 
and non rational stochastic language. We decide to compare DEES to the most well 
known probabilistic grammatical inference approaches: The algorithms Alergia [3] and 
MDI [9] that are able to identify PDAs. These algorithms can be tuned by a parameter, 
in the experiments we choose the best parameter which gives the best result on all the 
samples, but we didn't change the parameter according to the size of the sample in order 
to take into account the impact of the sample sizes. 

In our experiments, we use two performance criteria. We measure the size of the 
inferred models by the number of states. Moreover, to evaluate the quality of the au- 
tomata, we use the Dl norm 1 between two models A and A' defined by : 

Dl(A,A') = J2 ueE ,\P A (u)-PA>(u)\. 
Dl norm is the strongest distance after Kullback Leibler. In practice, we use an approx- 
imation by considering a subset of S* generated by A (A will be the target for us). 



1 Note that we can't use the Kullback-Leibler measure because it is not robust with null proba- 
bility strings which implies to smooth the learned models, and also because automata produced 
by DEES do not always define stochastic language, i.e. some strings may have a negative value. 
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Fig. 4. Results obtained with the prefix reduced multiplicity automaton of three states of Figure 3 

admitting a representation with a PDA of twelve states. 



Size of the learning sample 

(b) Size of the model. 



A 





B 



Fig. 5. Automaton A is a PA with non rational parameters in 
represented by an MA B with rational parameters in Q [5]. 




l)/2). A can be 



We carried out a first series of experiment where the target automaton can be repre- 
sented by a PDA. We consider a stochastic language defined by the automaton on Fig- 
ure 3. This stochastic language can be represented by a multiplicity automaton of three 
states and by an equivalent minimal PDA of twelve states [6] (Alergia and MDI can 
then identify this automaton). To compare the performances of the three algorithms, we 
used the following experimental set up. From the target automaton, we generate sam- 
ples from size 100 to 10000. Then, for each sample we learn an automaton with the 
three algorithms and compute the norm Dl between them and the target. We repeat this 
experimental setup 10 times and give the average results. Figure 4 reports the results 
obtained. If we consider the size of the learned models, DEES finds quickly the target 
automaton, while MDI only begins to tend to the target PDA after 10000 examples. 
The automata produced by Alergia are far from this target. This behavior can be ex- 
plained by the fact that these two algorithms need significantly longer examples to find 
the correct target and thus larger samples, this is also amplified because there are more 
parameters to estimate. In practise we noticed that the correct structure can be found af- 
ter more than 100000 examples. If we look at the distance Dl, DEES outperforms MDI 
and Alergia (which have the same behavior) and begins to converge after 500 examples. 

We carried out other series of experiments for evaluating DEES when the target 
belongs to the class of PA. First, we consider the simple automaton of Figure 5 which 
defines a stochastic language that can be represented by a PA with parameters in R + . We 
follow the same experimental setup as in the first experiment, the results are reported 
on Figure 6. According to our 2 performance criteria, DEES outperforms again Alergia 
and MDI. In fact, the target can not be modeled correctly by Alergia and MDI because 
it can not be represented by a PDA. This explains why these algorithms can't find a 
good model. For them, the best answer is to produce a unigram model. Alergia even 
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Fig. 6. Results obtained with the target automaton of Figure 5 admitting a representation in the 
class PA with non rational parameters. 
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Fig. 7. Results obtained from a set of PA generated randomly. 

diverge at a given step (this behavior is due to its fusion criterion that becomes more 
restrictive with the increasing of the learning set) and MDI returns always the unigram. 
DEES finds the correct structure quickly and begins to converge after 1000 examples. 
This behavior confirms the fact DEES can produce better models with small samples 
because it constructs small representations. On the other hand, Alergia and MDI seem 
to need a huge number of examples to find a good approximation of the target, even 
when the target is relatively small. 

We made another experiment in the class of PA. We study the behavior of DEES 
when the learning samples are generated from different targets randomly generated. For 
this experiment, we take an alphabet of three letters and we generate randomly some 
PA with a number of states from 2 to 25. The PA are generated in order to have a prefix 
representation which guarantees that all the states are reachable. The rest of the tran- 
sitions and the values of the parameters are chosen randomly. Then, for each target, 
we generate 5 samples of size 300 times the number of states of the target. We made 
this choice because we think that for small targets the samples may be sufficient to find 
a good approximation, while for bigger targets there is a clear lack of examples. This 
last point allows us to see the behaviors of the algorithms with small amounts of data. 
We learn an automaton from each sample and compare it to the corresponding target. 
Note that we didn't use MDI in this experiment because this algorithm is extremely 
hard to tune, which implies an important cost in time for finding a good parameter. The 
parameter of Alergia is fixed to a reasonable value kept for all the experiment. Results 
for Alergia and DEES are reported on Figure 7. We also add the empirical distance of 
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(a) Results with distance Dl (b) Size of the model. 

Fig. 8. Results obtained with samples generated from a non rational stochastic language. 

the samples to the target automaton. If you consider the D 1 norm, the performances 
of Alergia depend highly on the empirical distribution. Alergia infers models close, or 
better, than those produced by DEES only when the empirical distribution is already 
very good, thus when it is not necessary to learn. Moreover, Alergia has a greater vari- 
ance which implies a weak robustness. On the other hand, DEES is always able to learn 
significantly small models almost always better, even with small samples. 

Finally, we carried out a last experiment where the objective is to study the behavior 
of the three algorithms with samples generated from a non rational stochastic language. 
We consider, as a target, the stochastic language generated using the p r algorithm from 
the automaton of Figure 2 (note that this automaton admits a prefix reduced represen- 
tation of 2 states). We took p = 3/10, a = 3/2 and (3 = 5/4. We follow the same 
experimental setup than the first experiment. Since we use rational representations, we 
measure the distance Dl from the automaton of Figure 2 using a sample generated by 
p r (i.e. we measure the Dl only for strings with a strictly positive value). The results are 
presented on Figure 8. MDI and Alergia are clearly not able to build a good estimation 
of the target distribution and we see that their best answer is to produce a unigram. On 
the other hand, DEES is able to identify a structure close to the MA that was used for 
defining the distribution and produces good automata after 2000 examples. This means 
that DEES seems able to produce pseudo-stochastic rational languages which are closed 
to a non rational stochastic distribution. 

5 Conclusion 

In this paper, we studied the class of pseudo-stochastic rational languages (PSRL) that 
are stochastic languages defined by multiplicity automata which do not define stochas- 
tic languages but share some properties with them. We showed that it is possible to 
decide wether an MA defines a PSRL, but we can't decide wether two MA define the 
same PSRL. Moreover, it is possible to define a stochastic language from these MA but 
this language is not rational in general. Despite of these drawbacks, we showed experi- 
mentally that DEES produces MA computing pseudo-stochastic rational languages that 
provide good estimates of a target stochastic language. We recall here that DEES is able 
to output automata with a minimal number of parameters which is clearly an advantage 
from a machine learning standpoint, especially for dealing with small datasets. More- 
over, our experiments showed that DEES outperforms standard probabilistic grammat- 
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ical inference approaches. Thus, we think that the class of pseudo-stochastic rational 
languages is promising for many applications in grammatical inference. Beyond the 
fact to continue the study of this class, we also plan to consider methods that could 
infer a class of MA strictly greater than the class of PSRL. We also began to work on 
an adaptation of the approaches presented in this paper to trees. 

References 

1. J. Berstel and C. Reutenauer. Les series rationnelles et leurs langages. Masson, 1984. 

2. V. D. Blondel and J. N. Tsitsiklis. A survey of computational complexity results in systems 
and control. Automatica, 36(9): 1249-1274, September 2000. 

3. R.C. Carrasco and J. Oncina. Learning stochastic regular grammars by means of a state 
merging method. In Proceedings of ICGI'94, LNAI, pages 139-150. Springer, 1994. 

4. F. Denis and Y. Esposito. Learning classes of probabilistic automata. In Proceedings 
COLT04, volume 3120 of LNCS, pages 124-139. Springer, 2004. 

5. F. Denis and Y. Esposito. Rational stochastic language. Technical report, LIF - Universite de 
Provence, 2006. 

6. F. Denis, Y. Esposito, and A. Habrard. Learning rational stochastic languages. In Proceedings 
of COLT' 06, 2006. 

7. F. R. Gantmacher. Theorie des matrices, tomes 1 et2. Dunod, 1966. 

8. A. Salomaa and M. Soittola. Automata: Theoretic Aspects of Formal Power Series. Springer- 
Verlag, 1978. 

9. F. Thollard, P. Dupont, and C. de la Higuera. Probabilistic dfa inference using kullback-leibler 
divergence and minimality. In Proceedings of ICML'00, pages 975-982, June 2000. 



14 



Amaury Habrard, Francois Denis, and Yann Esposito 



6 Annex 

6.1 Proof of Lemma 1 

Lemma 1. Let Abe a reduced representation of a pseudo-stochastic language. Then, 
p(A) < 1. 

Proof (sketch). Let A = (E, Q, <p, l 1 t) be a reduced representation of r and let B = 
(E,Q B ,ip Bl i B ,T B ) be an MA that computes r and such that p(B) < 1. Since A 
is reduced, the vector subspace E of M.((E)) spanned by {r^.qlq G Qa} is equal to 
[{ur|w G E*}] and is contained in the vector subspace F spanned by {rB. q \q G Q B }. 

The set {r^.glq G Qa} is a basis of E. Let us complete it into a basis of F and let 
Pe be the corresponding projection defined from F over E. Note that for any x G E 
and any r G F, we have P B (xr) = xP B (r). 

For any state q & Qb, let us express Pe^b^) in this basis. 

PE(r B , q ) = \q, q 'r A ,q'- 
q'eQ A 

Note that for any MA C and any state q of C, 

X xr c ,q = X (p c (q,E,q')r C \q>. 
xeE q'&Qc 

Therefore, for any state q of B, we have 
Pe(^2 ±r B,q) = p e{ X] <PB(q,E,q')r B , q >) = X <p B {q,E,q') ^ A,',,"^,^ 

xeS 9'£Qb q"eQ A 

but also 

■Pi?(X irB -9) = X xP E {r B ,q) = E 'V<Z' r X<z' 

i££ xeE x£E q'£Q A 

= X X VAiq' ,E,q")r A . q „ 

and therefore 

q'&QB q"&QA q'&Q A q"&QA 

Now, let Ma (resp. Mb, resp. yl) be the matrix indexed by x Qa (resp. Q B x 
Qb, resp. Q B x Q A ) and defined by M A [q,q'] = ip A (q,E,q') (resp. M B [q,q'] = 
ip B (q, E, q 1 ), resp. A[q, q'] = \ q , q >). Note that the rank of A is equal to the dimension 
of E. We have 

M B A = AM a- 

Let p be an eigenvalue of Ma and let X an associated eigenvector. We have 

M B AX = AM A = fJ,AX 

and since the rank of A is maximal, p is also an eigenvalue of M B . Therefore, p(B) < 1 
implies that p(A) < 1. □ 
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6.2 Proof of Proposition 3 

Proposition 3. It is undecidable whether two rational series define the same stochastic 
language. 

Proof. Let A = (S, Q, l, tp, t) be an MA which satisfies |tvi(u)| < A'"' for some 
A < l/(2|i7|). Let S — {x\x G £} be a disjoint copy of S and let c be a new letter: 
c S U S. Let u — > u be the morphism inductively defined from S* into 17* by e = e 
and ux ~u-x. 

Let B = (S B ,Q, i, <Pb, t ) defined by S B = £ U S U {c}, tp B (q, c, q') = 1 if 
q = q' and otherwise, tp B (q 7 x, g') = ip B (q,x, q') = f(q, x, q') if x G S. 

Let / be the rational series defined by f(w) = ta{uv) if w = mcw for some u, v G 
17* and otherwise. 

Let p be such that 2A < p < l/\£\, let r be the rational series defined on S B by 
r(u>) = p'™ 1 if w G S and otherwise. Let g = f + r. Check that 



2 



E i/(^)i= E mhi< E A ' H = E(^i A )" = {—\ 

11,-ue.E* u,«6E* ti.uSi;* \n>0 / ^ 

Therefore, the sum Ylu>e£* d( w ) * s absolutely convergent. Check also that 

\E\p \l-\E\\J (l-\E\p)(l-\E\X) 



ujes* tugs* i if \ i i / \ i irn i i / 

Let p = (J2 weS ' B flM) -1 and h = pg- 

For any u G 27*, = h(ucS%) = h{ucS*) = pr A {uS*) and h(uE B ) = 

h{uE*) + h{ucS*) = M irfei + ^K))- 
Check also that for any « G 27*, 

/iM nl"l «l u l \l" 

l-|27|pM v 7 " 1 - |27|pl«l f^t, l -l-\S\p\ u \ 1-|27|AH 

Therefore, > and h(uxE* B ) > for every u G 27* and any letter x G 27. On 
the other hand, h(ucS* B ) > iff r A (uS*) < 0. That is, p h = p r iff ^(uZ 1 *) < 
for every u G 27*. An algorithm capable to decide whether ph = p r could be used to 
decide whether ta{uS*) < for every u G 27*. □ 



6.3 Drawing a word according to p r 

Modification of Algorithm 2 in order to draw a word according to the distribution p r . 
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Input: an MA A = (S, Q = {qi, . . . , q n }, tp, i, r) s.t. p(A) < 1 and rvt(X") — 1 
Output: a word it drawn according to p TA 

for i = 1, . . . , n /* this step is polynomial in n and is done once*/ do 
[_ Si <— TA, qi (£*); e.% *— o(qi); 

u <— e; 

finished <— false; 

w < — s; X <— 1 /* A is equal to p rA (wE*)*l ; 
while no/ finished do 

A <— 1 ; 

^ <- Er=i c i r ( , 7i); 

ifu > then 5"^ {(e,u)}; 

A <- w; 

for a; G X 1 do 

V *~ E"j=i eitp{qi,x,qj)sj; 
if v > then 

S ^ SU{(x,v)}; 

A <- A + v; 

for (x, w) G S do (x, w) <— (x, v/X); 

x <— Draw(S) /*Draw randomly an element (a;, p) of S with probability p*l; 
if a; = s then 

finished «— True; 

else 

u <— ita;; 

for i = 1, . . . ,n do <- E" =1 ejcp(qj,x, 



Algorithm 3: Algorithm drawing a word according to the distribution p r . 
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